knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.1, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.15.0). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
##
## Attaching package: 'brms'
## The following object is masked from 'package:rstan':
##
## loo
## The following object is masked from 'package:stats':
##
## ar
library(furrr)
## Loading required package: future
library(modelr)
library(tidybayes)
##
## Attaching package: 'tidybayes'
## The following objects are masked from 'package:brms':
##
## dstudent_t, pstudent_t, qstudent_t, rstudent_t
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
# theme_set(theme_minimal()) +
# theme(
# axis.text = element_text(size = 12),
# axis.title = element_text(size = 14),
# axis.text.y = element_blank(),
# axis.title.y = element_blank(),
# strip.text = element_text(size = 14),
# panel.spacing = unit(4, "lines")#
# )
Tally the false positives by each condition. The graphs below show the distribution of the number of false positives in a single trial. We can see that most people are making 2 or fewer false positives in each trial, however we do not see much differences based on the number of regions shown to participants.
## # A tibble: 12 × 7
## condition nregions tp tn fp fn fdr
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ci 8 0.273 0.417 0.0795 0.230 0.225
## 2 ci 12 0.267 0.432 0.0588 0.243 0.181
## 3 dotplot 8 0.325 0.406 0.0899 0.179 0.217
## 4 dotplot 12 0.311 0.427 0.0635 0.198 0.169
## 5 halfeye 8 0.295 0.423 0.0736 0.209 0.200
## 6 halfeye 12 0.279 0.442 0.0481 0.230 0.147
## 7 hops_bootstrap 8 0.184 0.439 0.0580 0.319 0.240
## 8 hops_bootstrap 12 0.177 0.445 0.0457 0.332 0.205
## 9 hops_mean 8 0.277 0.438 0.0579 0.227 0.173
## 10 hops_mean 12 0.275 0.449 0.0414 0.235 0.131
## 11 raw_data 8 0.242 0.430 0.0662 0.262 0.215
## 12 raw_data 12 0.238 0.437 0.0536 0.272 0.184
We explore if there are differences in the number of regions that are selected by the participants when the possible number of testable hypotheses changes (i.e. 8 or 12). We want to make sure that average participants do not resort to a strategy where they select a fixed number of regions regardless of the potential number of testable hypotheses.
In the following graph we show the mean values of TP, TN, FP and FN in each uncertainty visualization condition, and separated by the number of graphs that we show each participant.
## # A tibble: 20 × 4
## ntrials method .category .value
## <int> <chr> <chr> <dbl>
## 1 8 bh TP 0.204
## 2 8 bh TN 0.479
## 3 8 bh FP 0.0179
## 4 8 bh FN 0.3
## 5 8 bh FDR 0.0353
## 6 8 uncorrected TP 0.336
## 7 8 uncorrected TN 0.446
## 8 8 uncorrected FP 0.05
## 9 8 uncorrected FN 0.168
## 10 8 uncorrected FDR 0.0934
## 11 12 bh TP 0.219
## 12 12 bh TN 0.483
## 13 12 bh FP 0.00714
## 14 12 bh FN 0.290
## 15 12 bh FDR 0.0164
## 16 12 uncorrected TP 0.329
## 17 12 uncorrected TN 0.469
## 18 12 uncorrected FP 0.0214
## 19 12 uncorrected FN 0.181
## 20 12 uncorrected FDR 0.0321
## [1] "check's out"
The research questions in our study are:
The goal of our modeling is to estimate the probability of a TP / TN / FP / FN for a given (or average trial) along with some uncertainty. Based on the results from our model, we attempt to answer our research questions.
We define the model and create the appropriate column in the data structure (y) for predicting multinomial outcome variables (brms requires the outcome variable to be a n \(\times\) k matrix where k is the number of categories, and n is the number of responses; here # of trials \(\times\) # of participants).
## Family: multinomial
## Links: mutn = logit; mufn = logit; mufp = logit
## Formula: y | trials(ntrials) ~ condition * adj_trial_id * nregions + (adj_trial_id * nregions | prolific_pid)
## Data: df (Number of observations: 24920)
## Samples: 4 chains, each with iter = 1250; warmup = 0; thin = 1;
## total post-warmup samples = 5000
##
## Group-Level Effects:
## ~prolific_pid (Number of levels: 356)
## Estimate Est.Error l-95% CI
## sd(mutn_Intercept) 0.56 0.02 0.51
## sd(mutn_adj_trial_id) 0.21 0.02 0.17
## sd(mutn_nregions12) 0.21 0.02 0.17
## sd(mutn_adj_trial_id:nregions12) 0.18 0.04 0.09
## sd(mufn_Intercept) 0.96 0.04 0.89
## sd(mufn_adj_trial_id) 0.36 0.04 0.29
## sd(mufn_nregions12) 0.44 0.03 0.39
## sd(mufn_adj_trial_id:nregions12) 0.53 0.05 0.43
## sd(mufp_Intercept) 0.83 0.04 0.75
## sd(mufp_adj_trial_id) 0.50 0.06 0.39
## sd(mufp_nregions12) 0.47 0.04 0.40
## sd(mufp_adj_trial_id:nregions12) 1.07 0.09 0.90
## cor(mutn_Intercept,mutn_adj_trial_id) 0.22 0.11 0.00
## cor(mutn_Intercept,mutn_nregions12) -0.08 0.09 -0.25
## cor(mutn_adj_trial_id,mutn_nregions12) -0.04 0.14 -0.31
## cor(mutn_Intercept,mutn_adj_trial_id:nregions12) 0.19 0.17 -0.15
## cor(mutn_adj_trial_id,mutn_adj_trial_id:nregions12) 0.20 0.20 -0.17
## cor(mutn_nregions12,mutn_adj_trial_id:nregions12) 0.65 0.15 0.31
## cor(mufn_Intercept,mufn_adj_trial_id) 0.11 0.10 -0.09
## cor(mufn_Intercept,mufn_nregions12) -0.05 0.07 -0.19
## cor(mufn_adj_trial_id,mufn_nregions12) -0.18 0.10 -0.38
## cor(mufn_Intercept,mufn_adj_trial_id:nregions12) 0.06 0.10 -0.13
## cor(mufn_adj_trial_id,mufn_adj_trial_id:nregions12) -0.07 0.14 -0.32
## cor(mufn_nregions12,mufn_adj_trial_id:nregions12) 0.67 0.08 0.49
## cor(mufp_Intercept,mufp_adj_trial_id) 0.38 0.10 0.17
## cor(mufp_Intercept,mufp_nregions12) -0.02 0.10 -0.21
## cor(mufp_adj_trial_id,mufp_nregions12) -0.19 0.12 -0.42
## cor(mufp_Intercept,mufp_adj_trial_id:nregions12) -0.15 0.10 -0.34
## cor(mufp_adj_trial_id,mufp_adj_trial_id:nregions12) -0.63 0.08 -0.77
## cor(mufp_nregions12,mufp_adj_trial_id:nregions12) 0.50 0.10 0.29
## u-95% CI Rhat Bulk_ESS
## sd(mutn_Intercept) 0.60 1.00 3328
## sd(mutn_adj_trial_id) 0.26 1.00 3253
## sd(mutn_nregions12) 0.26 1.00 2473
## sd(mutn_adj_trial_id:nregions12) 0.26 1.00 1483
## sd(mufn_Intercept) 1.04 1.00 3140
## sd(mufn_adj_trial_id) 0.44 1.00 3478
## sd(mufn_nregions12) 0.50 1.00 2692
## sd(mufn_adj_trial_id:nregions12) 0.64 1.00 2175
## sd(mufp_Intercept) 0.91 1.00 4519
## sd(mufp_adj_trial_id) 0.61 1.00 3925
## sd(mufp_nregions12) 0.56 1.00 3751
## sd(mufp_adj_trial_id:nregions12) 1.26 1.00 2758
## cor(mutn_Intercept,mutn_adj_trial_id) 0.42 1.00 4618
## cor(mutn_Intercept,mutn_nregions12) 0.11 1.00 4078
## cor(mutn_adj_trial_id,mutn_nregions12) 0.24 1.00 1680
## cor(mutn_Intercept,mutn_adj_trial_id:nregions12) 0.51 1.00 4445
## cor(mutn_adj_trial_id,mutn_adj_trial_id:nregions12) 0.60 1.00 2979
## cor(mutn_nregions12,mutn_adj_trial_id:nregions12) 0.87 1.00 2731
## cor(mufn_Intercept,mufn_adj_trial_id) 0.30 1.00 4296
## cor(mufn_Intercept,mufn_nregions12) 0.09 1.00 4345
## cor(mufn_adj_trial_id,mufn_nregions12) 0.02 1.00 1901
## cor(mufn_Intercept,mufn_adj_trial_id:nregions12) 0.25 1.00 4791
## cor(mufn_adj_trial_id,mufn_adj_trial_id:nregions12) 0.22 1.00 2478
## cor(mufn_nregions12,mufn_adj_trial_id:nregions12) 0.81 1.00 2536
## cor(mufp_Intercept,mufp_adj_trial_id) 0.57 1.00 4119
## cor(mufp_Intercept,mufp_nregions12) 0.18 1.00 4574
## cor(mufp_adj_trial_id,mufp_nregions12) 0.05 1.00 3059
## cor(mufp_Intercept,mufp_adj_trial_id:nregions12) 0.03 1.00 4325
## cor(mufp_adj_trial_id,mufp_adj_trial_id:nregions12) -0.46 1.00 2294
## cor(mufp_nregions12,mufp_adj_trial_id:nregions12) 0.68 1.00 2405
## Tail_ESS
## sd(mutn_Intercept) 4274
## sd(mutn_adj_trial_id) 4060
## sd(mutn_nregions12) 3574
## sd(mutn_adj_trial_id:nregions12) 2233
## sd(mufn_Intercept) 4104
## sd(mufn_adj_trial_id) 4420
## sd(mufn_nregions12) 3662
## sd(mufn_adj_trial_id:nregions12) 2779
## sd(mufp_Intercept) 4727
## sd(mufp_adj_trial_id) 4557
## sd(mufp_nregions12) 4776
## sd(mufp_adj_trial_id:nregions12) 4271
## cor(mutn_Intercept,mutn_adj_trial_id) 4750
## cor(mutn_Intercept,mutn_nregions12) 4445
## cor(mutn_adj_trial_id,mutn_nregions12) 2836
## cor(mutn_Intercept,mutn_adj_trial_id:nregions12) 4943
## cor(mutn_adj_trial_id,mutn_adj_trial_id:nregions12) 4347
## cor(mutn_nregions12,mutn_adj_trial_id:nregions12) 3174
## cor(mufn_Intercept,mufn_adj_trial_id) 4789
## cor(mufn_Intercept,mufn_nregions12) 4745
## cor(mufn_adj_trial_id,mufn_nregions12) 3405
## cor(mufn_Intercept,mufn_adj_trial_id:nregions12) 4912
## cor(mufn_adj_trial_id,mufn_adj_trial_id:nregions12) 3701
## cor(mufn_nregions12,mufn_adj_trial_id:nregions12) 3705
## cor(mufp_Intercept,mufp_adj_trial_id) 4385
## cor(mufp_Intercept,mufp_nregions12) 4617
## cor(mufp_adj_trial_id,mufp_nregions12) 4118
## cor(mufp_Intercept,mufp_adj_trial_id:nregions12) 4284
## cor(mufp_adj_trial_id,mufp_adj_trial_id:nregions12) 3811
## cor(mufp_nregions12,mufp_adj_trial_id:nregions12) 3871
##
## Population-Level Effects:
## Estimate Est.Error
## mutn_Intercept 0.49 0.07
## mufn_Intercept -0.23 0.11
## mufp_Intercept -1.78 0.11
## mutn_conditiondotplot -0.23 0.10
## mutn_conditionhalfeye -0.05 0.10
## mutn_conditionhops_bootstrap 0.45 0.10
## mutn_conditionhops_mean 0.01 0.10
## mutn_conditionraw_data 0.17 0.10
## mutn_adj_trial_id 0.12 0.05
## mutn_nregions12 0.12 0.04
## mutn_conditiondotplot:adj_trial_id -0.05 0.08
## mutn_conditionhalfeye:adj_trial_id -0.05 0.08
## mutn_conditionhops_bootstrap:adj_trial_id -0.03 0.08
## mutn_conditionhops_mean:adj_trial_id -0.01 0.08
## mutn_conditionraw_data:adj_trial_id 0.10 0.08
## mutn_conditiondotplot:nregions12 -0.03 0.06
## mutn_conditionhalfeye:nregions12 -0.04 0.06
## mutn_conditionhops_bootstrap:nregions12 0.03 0.06
## mutn_conditionhops_mean:nregions12 0.01 0.06
## mutn_conditionraw_data:nregions12 -0.06 0.06
## mutn_adj_trial_id:nregions12 0.10 0.07
## mutn_conditiondotplot:adj_trial_id:nregions12 -0.11 0.11
## mutn_conditionhalfeye:adj_trial_id:nregions12 -0.03 0.11
## mutn_conditionhops_bootstrap:adj_trial_id:nregions12 0.21 0.11
## mutn_conditionhops_mean:adj_trial_id:nregions12 0.13 0.11
## mutn_conditionraw_data:adj_trial_id:nregions12 0.00 0.11
## mufn_conditiondotplot -0.48 0.16
## mufn_conditionhalfeye -0.19 0.16
## mufn_conditionhops_bootstrap 0.77 0.16
## mufn_conditionhops_mean -0.06 0.16
## mufn_conditionraw_data 0.30 0.16
## mufn_adj_trial_id 0.23 0.07
## mufn_nregions12 0.10 0.07
## mufn_conditiondotplot:adj_trial_id 0.06 0.11
## mufn_conditionhalfeye:adj_trial_id -0.15 0.11
## mufn_conditionhops_bootstrap:adj_trial_id -0.04 0.11
## mufn_conditionhops_mean:adj_trial_id -0.02 0.11
## mufn_conditionraw_data:adj_trial_id 0.17 0.10
## mufn_conditiondotplot:nregions12 0.01 0.10
## mufn_conditionhalfeye:nregions12 0.03 0.10
## mufn_conditionhops_bootstrap:nregions12 0.14 0.10
## mufn_conditionhops_mean:nregions12 0.11 0.10
## mufn_conditionraw_data:nregions12 0.04 0.10
## mufn_adj_trial_id:nregions12 0.04 0.10
## mufn_conditiondotplot:adj_trial_id:nregions12 -0.13 0.16
## mufn_conditionhalfeye:adj_trial_id:nregions12 0.06 0.15
## mufn_conditionhops_bootstrap:adj_trial_id:nregions12 0.34 0.15
## mufn_conditionhops_mean:adj_trial_id:nregions12 0.28 0.15
## mufn_conditionraw_data:adj_trial_id:nregions12 0.05 0.15
## mufp_conditiondotplot 0.09 0.15
## mufp_conditionhalfeye -0.15 0.16
## mufp_conditionhops_bootstrap 0.14 0.15
## mufp_conditionhops_mean -0.25 0.15
## mufp_conditionraw_data 0.04 0.15
## mufp_adj_trial_id -0.28 0.10
## mufp_nregions12 -0.45 0.09
## mufp_conditiondotplot:adj_trial_id 0.15 0.13
## mufp_conditionhalfeye:adj_trial_id -0.11 0.14
## mufp_conditionhops_bootstrap:adj_trial_id -0.09 0.14
## mufp_conditionhops_mean:adj_trial_id -0.13 0.14
## mufp_conditionraw_data:adj_trial_id -0.17 0.14
## mufp_conditiondotplot:nregions12 0.07 0.12
## mufp_conditionhalfeye:nregions12 -0.01 0.12
## mufp_conditionhops_bootstrap:nregions12 0.09 0.13
## mufp_conditionhops_mean:nregions12 -0.02 0.13
## mufp_conditionraw_data:nregions12 0.16 0.13
## mufp_adj_trial_id:nregions12 -0.02 0.16
## mufp_conditiondotplot:adj_trial_id:nregions12 -0.23 0.23
## mufp_conditionhalfeye:adj_trial_id:nregions12 0.07 0.23
## mufp_conditionhops_bootstrap:adj_trial_id:nregions12 0.04 0.24
## mufp_conditionhops_mean:adj_trial_id:nregions12 -0.03 0.24
## mufp_conditionraw_data:adj_trial_id:nregions12 0.13 0.23
## l-95% CI u-95% CI Rhat
## mutn_Intercept 0.35 0.63 1.00
## mufn_Intercept -0.45 -0.01 1.00
## mufp_Intercept -1.99 -1.58 1.00
## mutn_conditiondotplot -0.42 -0.03 1.00
## mutn_conditionhalfeye -0.26 0.15 1.00
## mutn_conditionhops_bootstrap 0.24 0.65 1.00
## mutn_conditionhops_mean -0.20 0.21 1.00
## mutn_conditionraw_data -0.03 0.36 1.00
## mutn_adj_trial_id 0.01 0.23 1.00
## mutn_nregions12 0.03 0.21 1.00
## mutn_conditiondotplot:adj_trial_id -0.20 0.10 1.00
## mutn_conditionhalfeye:adj_trial_id -0.20 0.10 1.00
## mutn_conditionhops_bootstrap:adj_trial_id -0.19 0.12 1.00
## mutn_conditionhops_mean:adj_trial_id -0.16 0.14 1.00
## mutn_conditionraw_data:adj_trial_id -0.06 0.26 1.00
## mutn_conditiondotplot:nregions12 -0.15 0.09 1.00
## mutn_conditionhalfeye:nregions12 -0.16 0.08 1.00
## mutn_conditionhops_bootstrap:nregions12 -0.10 0.15 1.00
## mutn_conditionhops_mean:nregions12 -0.10 0.13 1.00
## mutn_conditionraw_data:nregions12 -0.18 0.06 1.00
## mutn_adj_trial_id:nregions12 -0.05 0.25 1.00
## mutn_conditiondotplot:adj_trial_id:nregions12 -0.32 0.09 1.00
## mutn_conditionhalfeye:adj_trial_id:nregions12 -0.24 0.18 1.00
## mutn_conditionhops_bootstrap:adj_trial_id:nregions12 -0.01 0.43 1.00
## mutn_conditionhops_mean:adj_trial_id:nregions12 -0.08 0.34 1.00
## mutn_conditionraw_data:adj_trial_id:nregions12 -0.22 0.22 1.00
## mufn_conditiondotplot -0.79 -0.16 1.00
## mufn_conditionhalfeye -0.50 0.14 1.00
## mufn_conditionhops_bootstrap 0.45 1.09 1.00
## mufn_conditionhops_mean -0.37 0.27 1.00
## mufn_conditionraw_data -0.02 0.61 1.00
## mufn_adj_trial_id 0.08 0.37 1.00
## mufn_nregions12 -0.04 0.23 1.00
## mufn_conditiondotplot:adj_trial_id -0.16 0.28 1.00
## mufn_conditionhalfeye:adj_trial_id -0.35 0.07 1.00
## mufn_conditionhops_bootstrap:adj_trial_id -0.24 0.17 1.00
## mufn_conditionhops_mean:adj_trial_id -0.22 0.18 1.00
## mufn_conditionraw_data:adj_trial_id -0.03 0.38 1.00
## mufn_conditiondotplot:nregions12 -0.18 0.21 1.00
## mufn_conditionhalfeye:nregions12 -0.17 0.22 1.00
## mufn_conditionhops_bootstrap:nregions12 -0.05 0.34 1.00
## mufn_conditionhops_mean:nregions12 -0.09 0.30 1.00
## mufn_conditionraw_data:nregions12 -0.15 0.24 1.00
## mufn_adj_trial_id:nregions12 -0.17 0.24 1.00
## mufn_conditiondotplot:adj_trial_id:nregions12 -0.44 0.18 1.00
## mufn_conditionhalfeye:adj_trial_id:nregions12 -0.24 0.37 1.00
## mufn_conditionhops_bootstrap:adj_trial_id:nregions12 0.05 0.63 1.00
## mufn_conditionhops_mean:adj_trial_id:nregions12 -0.01 0.58 1.00
## mufn_conditionraw_data:adj_trial_id:nregions12 -0.25 0.33 1.00
## mufp_conditiondotplot -0.20 0.39 1.00
## mufp_conditionhalfeye -0.45 0.16 1.00
## mufp_conditionhops_bootstrap -0.17 0.44 1.00
## mufp_conditionhops_mean -0.54 0.05 1.00
## mufp_conditionraw_data -0.26 0.34 1.00
## mufp_adj_trial_id -0.47 -0.10 1.00
## mufp_nregions12 -0.63 -0.28 1.00
## mufp_conditiondotplot:adj_trial_id -0.10 0.42 1.00
## mufp_conditionhalfeye:adj_trial_id -0.37 0.16 1.00
## mufp_conditionhops_bootstrap:adj_trial_id -0.37 0.19 1.00
## mufp_conditionhops_mean:adj_trial_id -0.42 0.15 1.00
## mufp_conditionraw_data:adj_trial_id -0.44 0.11 1.00
## mufp_conditiondotplot:nregions12 -0.17 0.31 1.00
## mufp_conditionhalfeye:nregions12 -0.26 0.23 1.00
## mufp_conditionhops_bootstrap:nregions12 -0.16 0.34 1.00
## mufp_conditionhops_mean:nregions12 -0.27 0.23 1.00
## mufp_conditionraw_data:nregions12 -0.08 0.40 1.00
## mufp_adj_trial_id:nregions12 -0.31 0.29 1.00
## mufp_conditiondotplot:adj_trial_id:nregions12 -0.68 0.22 1.00
## mufp_conditionhalfeye:adj_trial_id:nregions12 -0.37 0.53 1.00
## mufp_conditionhops_bootstrap:adj_trial_id:nregions12 -0.43 0.51 1.00
## mufp_conditionhops_mean:adj_trial_id:nregions12 -0.50 0.43 1.00
## mufp_conditionraw_data:adj_trial_id:nregions12 -0.32 0.59 1.00
## Bulk_ESS Tail_ESS
## mutn_Intercept 2246 3336
## mufn_Intercept 1712 2730
## mufp_Intercept 3406 4156
## mutn_conditiondotplot 2259 3635
## mutn_conditionhalfeye 2282 3469
## mutn_conditionhops_bootstrap 2407 3246
## mutn_conditionhops_mean 2276 3232
## mutn_conditionraw_data 2519 3258
## mutn_adj_trial_id 3857 4485
## mutn_nregions12 4418 4394
## mutn_conditiondotplot:adj_trial_id 4348 4897
## mutn_conditionhalfeye:adj_trial_id 4226 4381
## mutn_conditionhops_bootstrap:adj_trial_id 4334 4376
## mutn_conditionhops_mean:adj_trial_id 4192 4573
## mutn_conditionraw_data:adj_trial_id 4244 4788
## mutn_conditiondotplot:nregions12 4500 4465
## mutn_conditionhalfeye:nregions12 4586 4781
## mutn_conditionhops_bootstrap:nregions12 4400 4738
## mutn_conditionhops_mean:nregions12 4441 4735
## mutn_conditionraw_data:nregions12 4745 4445
## mutn_adj_trial_id:nregions12 3750 4353
## mutn_conditiondotplot:adj_trial_id:nregions12 4086 4578
## mutn_conditionhalfeye:adj_trial_id:nregions12 3796 4728
## mutn_conditionhops_bootstrap:adj_trial_id:nregions12 4208 4737
## mutn_conditionhops_mean:adj_trial_id:nregions12 4287 4764
## mutn_conditionraw_data:adj_trial_id:nregions12 3981 4793
## mufn_conditiondotplot 1891 3126
## mufn_conditionhalfeye 2035 3402
## mufn_conditionhops_bootstrap 1882 3449
## mufn_conditionhops_mean 1676 2988
## mufn_conditionraw_data 1797 3103
## mufn_adj_trial_id 3839 4733
## mufn_nregions12 4164 3854
## mufn_conditiondotplot:adj_trial_id 4385 4912
## mufn_conditionhalfeye:adj_trial_id 4035 4755
## mufn_conditionhops_bootstrap:adj_trial_id 4315 4507
## mufn_conditionhops_mean:adj_trial_id 4538 4912
## mufn_conditionraw_data:adj_trial_id 4065 4793
## mufn_conditiondotplot:nregions12 4238 4487
## mufn_conditionhalfeye:nregions12 4569 4951
## mufn_conditionhops_bootstrap:nregions12 4196 4604
## mufn_conditionhops_mean:nregions12 4175 4654
## mufn_conditionraw_data:nregions12 4340 4890
## mufn_adj_trial_id:nregions12 4099 4635
## mufn_conditiondotplot:adj_trial_id:nregions12 4338 4603
## mufn_conditionhalfeye:adj_trial_id:nregions12 4278 4823
## mufn_conditionhops_bootstrap:adj_trial_id:nregions12 4302 4747
## mufn_conditionhops_mean:adj_trial_id:nregions12 4438 4680
## mufn_conditionraw_data:adj_trial_id:nregions12 4150 4525
## mufp_conditiondotplot 3115 3809
## mufp_conditionhalfeye 3554 4413
## mufp_conditionhops_bootstrap 3857 4489
## mufp_conditionhops_mean 3440 4118
## mufp_conditionraw_data 3344 4038
## mufp_adj_trial_id 3929 4739
## mufp_nregions12 3923 4503
## mufp_conditiondotplot:adj_trial_id 4418 4735
## mufp_conditionhalfeye:adj_trial_id 4421 4828
## mufp_conditionhops_bootstrap:adj_trial_id 4554 4714
## mufp_conditionhops_mean:adj_trial_id 4543 4879
## mufp_conditionraw_data:adj_trial_id 4193 4581
## mufp_conditiondotplot:nregions12 4245 4664
## mufp_conditionhalfeye:nregions12 4033 4515
## mufp_conditionhops_bootstrap:nregions12 4353 4552
## mufp_conditionhops_mean:nregions12 4420 4549
## mufp_conditionraw_data:nregions12 4489 4733
## mufp_adj_trial_id:nregions12 4587 4805
## mufp_conditiondotplot:adj_trial_id:nregions12 4864 4734
## mufp_conditionhalfeye:adj_trial_id:nregions12 4622 4946
## mufp_conditionhops_bootstrap:adj_trial_id:nregions12 4750 5080
## mufp_conditionhops_mean:adj_trial_id:nregions12 5050 4909
## mufp_conditionraw_data:adj_trial_id:nregions12 4766 4866
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Before we show the results from the model, we first run some posterior predictive checks to make sure that the model is able to recover the actual data
As the posterior predictive checks from the model, for all our primary population-level parameters, look good, we will examine the results closely. Before we try to visualise the model predictions, we need to extract posterior samples from the fitted model object:
If users are not performing any form of multiple comparisons correction, then intuitively, on average, a participant in our study will have more number of False positives when presented with 12 graphs as opposed to 8 graphs (the two within person conditions). More directly, we can compare the False Discovery rate when \(nregions = 8\) vs when \(nregions = 12\). If the FDR is constant or less for \(nregions = 12\) compared to \(nregions = 8\), then it implies that participants are performing some form of multiple comparisons correction.
First, we need to calculate the False Positive rate without any multiple comparisons correction:
FDR’s for combinations of method (uncorrected and B-H) and number of regions/hypotheses (8 or 12)
## # A tibble: 4 × 3
## ntrials method FDR
## <int> <chr> <dbl>
## 1 8 bh 0.0353
## 2 8 uncorrected 0.0934
## 3 12 bh 0.0164
## 4 12 uncorrected 0.0321
Compute average marginal effects (AMEs) for nregions and condition from posterior retrodictive, and compare against simulated FDR’s. Colors from https://coolors.co/393d3f-c98ca7-e76d83-f5b700-a5c4d4
Colors: https://coolors.co/27187e-758bfd-aeb8fe-f5b700-f1f2f6
Individual plot for FP
In the figure below, we see that, on average, the FDR decreases for participants when presented with more graphs. This suggests that they are likely performing some form of implicit multiple comparisons correction.
## # A tibble: 2 × 7
## ntrials fp_rate .lower .upper .width .point .interval
## <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 8 0.145 0.134 0.157 0.95 median qi
## 2 12 0.103 0.0935 0.114 0.95 median qi
We see that the decrease in the False Discovery Rate is, on average, a magnitude of 4 percentage points, with a 95% credible interval of [0.033, 0.051]. This implies there is almost a 30% reduction in the FDR.
Next, since we have different visualization conditions, we inspect if this difference is persistent across all the conditions. From the following figure below, we see that the FDR decreases for participants consistently across all the uncertainty representations, suggesting that it is likely not an artifact of certain forms of visual representations. The magnitude of this decrease appears to be consistent across the different uncertainty representations as well.
{r 04-fdr-nregions-by-vis, fig.width = 8, fig.height = 4}
Thus, our results suggest that users in our experimental set up implicitly perform some form of multiple comparisons correction. Because our experimental design incentivised participants against making False Discoveries proportionate to performing a NHST at a 95% confidence intervals, we cannot tell if participants would always behave this way. We believe that in the absence of incentives, participants may not control for False Positives in a similar manner, as suggested by the results of the study by Zgraggen et al.
To answer this question, we first look at the FDR across the different uncertainty representations, marginalised over the number of regions shown to participants. This gives us the aggregate effect over the two within-subjects conditions in the study.
From the figure above, we can see that, on average, using uncertainty representatoions such as Hypothetical Outcome Plot of the mean difference and Probability Density Function of the difference reliably decreases the FDR, with an observed decrease of 0.04 and 0.03 percentage points respectively (95% CI: [-0.072, 0.005] and [0.065, 0.02] respectively). Some other commonly used uncertainty representations, such as Confidence Intervals appear to have a small, but unreliable effect towards decreasing (~ 1.5 percentage points, 95% CI: [-0.05, 0.02]) FDR. On the other hand, other certain other uncertainty representations such as dotplots of the mean difference and Hypothetical outcome plot of bootstrapped data samples` appear to have no improvement or even slightly worsen the FDR. (the exact estimates are present in the table below)
## # A tibble: 5 × 7
## condition fp_rate .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ci - raw_data -0.0150 -0.0506 0.0176 0.95 median qi
## 2 dotplot - raw_data -0.00300 -0.0404 0.0328 0.95 median qi
## 3 halfeye - raw_data -0.0311 -0.0657 0.00208 0.95 median qi
## 4 hops_bootstrap - raw_data 0.00770 -0.0307 0.0475 0.95 median qi
## 5 hops_mean - raw_data -0.0382 -0.0719 -0.00541 0.95 median qi
Are these differences consistent across the number of regions shown?
Based on the plots below, we find that the differences are fairly consistent across the within-subjects manipulation.
Before we compare the composite scores, we look at the potential learning effects in our primary research questions. In repeated measures experimental designs such as this, where we also provide participants feedback (in the first 5 trials in each block), we might expect to see some learning effect (or at least variation in the responses over the course of the trials). In the figure below, we plot the change in the probability of TP/TN/FP/FN in each condition.
From the figure below, we see that the our participants still perform some form of implicit multiple comparisons correction, and this persists even after accounting for the potential effect of learning.
Next we test if there were any effects of learning on the differences between the uncertainty representations. In other words, does the effect of learning dominate over the effect of the uncertainty representation?
From the figure below, we can see that the effect persists for the unertainty representations which reduces FDR (probability density plot and HOPs of the mean difference), although the magnitude of the mean effect is smaller by around 1 percentage point. This indicates that certain forms of uncertainty representations are reliably better at reducing the FDR.
We first take a look at the probability of TP/TN/FP/FN in each uncertainty representation condition, marginalised over \(nregions\). Interestingly, we find that there isn’t a lot of difference in the probability of an average user making a FP on an average trial, including some of the uncertainty representations with better FDR; except for the dotplot condition, all the other uncertainty representations appear comparable to the baseline (raw data) in terms of probability of False Positives in an average trial. The improvement in FDR usually arises from the analysts being able to correctly identify True Positives more accurately where we see large differences.The probability of False Negative also varies substantially across the different conditions, with HOPS bootstrap performing worse and all the other uncertainty representations performing better than the baseline condition (with dotplot performing the best).There is also variation in the probability of making a False Negative across the conditions, but little or no difference between the probability of making a True Negative.
Because it is difficult to compare the rates of TP / TN / FP / FN across conditions, we will use metrics developed in ML such as F-scores and Matthews Correlation Coefficient to obtain a composite score. Another way of comparing the different conditions would be to use the payout which serves as the incentive for the different participants.
F-scores are a common metric used in ML to get a composite score for the performance of an algorithm and takes into account the number of True Positives, False Positives and False Negatives. It is given by \(\text{F-score} = \frac{2precision}{precision + recall} = \frac{2TP}{2TP + FN + FP}\). We can use this to compare performance across the different uncertainty representation conditions.
In this analysis, we would compare the F-scores of an average user to those of the BH procedure (which we consider optimal for this task).
First we compare the difference in F-scores when we manipulate \(nregions\) i.e. the number of graphs presented to participants (8 or 12). We see that F-scores actually decrease when participants were presented with more graphs (which might be because the decrease in FDR might also entail a decrease in the number of True Positives).
Estimate of f-scores for each region \(\in\) {8, 12}.
## # A tibble: 2 × 7
## ntrials fscore .lower .upper .width .point .interval
## <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 8 0.656 0.637 0.675 0.95 median qi
## 2 12 0.637 0.616 0.659 0.95 median qi
Visualisation of the probability distribution of f-scores for each region \(\in\) {8, 12}.
Next, comparing the difference in F-scores between different uncertainty representations, we see that all the uncertainty representations, except HOPs of bootstrapped data samples reliably increase F-scores. In other words, the accuracy increases when these uncertainty representations are used. Interestingly, the dotplot results in the highest accuracy (by almost 15 percentage points, 95% CI: [0.08, 0.2]) compared to the Baseline. Other uncertainty representations such as probability density plots, HOPS and 95% confidence intervals of the mean difference also improves the F-scores.
The following table summarise the difference in F-scores of the different conditions compared to the baseline.
## # A tibble: 5 × 7
## condition fscore .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ci - raw_data 0.0704 0.00753 0.134 0.95 median qi
## 2 dotplot - raw_data 0.145 0.0826 0.207 0.95 median qi
## 3 halfeye - raw_data 0.108 0.0410 0.174 0.95 median qi
## 4 hops_bootstrap - raw_data -0.115 -0.194 -0.0403 0.95 median qi
## 5 hops_mean - raw_data 0.0753 0.00431 0.145 0.95 median qi
Matthews Correlation Coefficient: One common drawback of the F-score is that it does not take into account False Negatives. The Matthew’s Correlation Coefficient is a proposed measure to address this limitation, and is calculated as \(MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\).
In our study, we incentivise participants using a payout scheme. It might be the case that participants are optimising for the incentives provided. Hence we compare the average payout across the different conditions. Based on this measure, we see that participants in the probability density plots and HOPS of the mean difference, on average, have higher payouts.
## quartz_off_screen
## 2
When \(nregions = 12\), our model estimates that there is a 81.1% and 70% probability, respectively, that these two conditions have a positive average payout
## # A tibble: 36 × 4
## # Groups: ntrials, condition [12]
## ntrials condition greater_than p
## <int> <fct> <chr> <dbl>
## 1 8 raw_data bh 0
## 2 8 raw_data uncorrected 0.255
## 3 8 raw_data zero 0
## 4 8 ci bh 0
## 5 8 ci uncorrected 0.282
## 6 8 ci zero 0
## 7 8 dotplot bh 0
## 8 8 dotplot uncorrected 0.0342
## 9 8 dotplot zero 0
## 10 8 halfeye bh 0
## # … with 26 more rows
## # A tibble: 2 × 3
## method ntrials payout
## <chr> <int> <dbl>
## 1 uncorrected 8 -53.7
## 2 uncorrected 12 50.3